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Abstract 

This paper describe a system-level approach to improve the 
area and delay of datapath designs that perform polynomial 
computations over Z 2 m, which are used in many applications 
such as computer graphics and digital signal processing 
domains. This approach optimizes the implementation of 
multivariate polynomial systems in terms of the number of 
arithmetic operations by performing optimization on a 
system level prior to high-level synthesis. Univariate 
functional decomposition of polynomial expressions and 
canonization form over Z 2 m are used in this method. We use 
GAUT high-level synthesis tool to generate RTL datapath 
architectures for the optimized polynomials. Experimental 
results on a set of benchmark applications with polynomial 
expressions show that this method outperforms conventional 
methods in terms of the area of the sequential datapath 
architectures in speed optimization mode with an average 
improvement of 25.81%, and the required clock cycles in two 
modes of speed optimization and area optimization, with an 
average improvement of 23.48% and 38.24%, respectively. 
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1. Introduction 

As the complexity and size of modern embedded 
application is continuously increasing, designing hardware at 
higher levels of abstraction for faster design adjustments and 
higher simulation speed is necessary. Conventional high 
level synthesis techniques are not efficient to eliminate 
redundancy and common sub-expression for polynomial 
datapaths over Z 2 m. Such polynomial functions have been 
optimized manually to achieve efficient register-transfer- 
level (RTL) implementation. This process can be time 
consuming and error prone. Hence, developing high level 
synthesis and optimization techniques to automate the design 
of custom polynomial datapaths from a behavioral 
description is desirable. 

The Homer form of a polynomial expression is a normal 
form representation using a nested format. This method 
transforms the expression into a sequence of nested additions 
and multiplications, which are suitable for univariate 
polynomials and for sequential machine evaluation using 
multiplier-accumulator units. 

Another algebraic technique is based on kernel/co-kernel 
computation [6], in which first, lowest cost form of given 
polynomials from canonization, square-free factorization and 



original forms is taken into consideration. Then common 
coefficients and common cubes are extracted using the 
kernel/co-kernel extraction technique from [7]. Common 
sub-expressions are determined using algebraic division 
technique. This method is only applicable to those 
polynomials in which linear blocks exist explicitly. 

In [7] and [8], a factoring method was proposed 
employing kernel/co-kernel extraction with common sub- 
expression elimination to reduce the size of implementation. 
The approximate factorization algorithm presented in [8] 
represents an arithmetic function / as a product of sub- 
functions / = fi x f 2 x -- x f n where f is a multivariate 
polynomial. However, this algorithm is able to factorize 
square-free polynomials and cannot deal with a sub-function 
ft with a degree higher than one. 

Another algebraic method has been proposed in [3] and 
then improved in [4]. The main idea is somehow similar to 
algebraic division techniques used in logic synthesis. This 
technique tries to decompose the original polynomial poly as 
poly = pi xp 2 + P 3 while p 3 should be minimized. For doing 
so, all possible initial values of and p 2 must be evaluated. 
Then for each initialization it is necessary to check whether 
other monomials in poly can be represented in the form p 1 x 
p 2 . Finally, the best initialization, which constitutes the 
lowest complexity p 3 , is chosen. The algebraic technique in 
[2] improves the optimization heuristics in [3] and [4] to 
extract more common sub-expressions by considering single- 
variable and hidden monomials. This technique makes use of 
finite ring algebra and Modular Horner Expansion Diagram 
[5]. This method first reduces the original polynomials over 
Z 2 m. Then common sub-expressions are extracted based on 
two heuristics. The main disadvantage of this technique is 
that decompositions are started from reduced polynomials 
while if the original polynomials are used more common sub- 
expressions would be extracted. 

The Algebraic method in [1] proposed for the first time a 
kind of polynomial optimization technique based on 
redundancy addition/removal. The main idea is somehow 
similar to logic optimization based on redundancy 
addition/removal which has been developed in logic 
synthesis area. In this method, first, kernels/co-kernels of 
given polynomials are extracted as good building blocks, 
then a large number of vanishing polynomials over Z 2 m, 
which are equal to 0 over Z 2 m, are generated as redundancy 
in order to transform the given polynomials in such a way 
that more common sub-expressions can be extracted. Finally, 
using algebraic division common sub-expressions are 
determined. 

In the current paper, we introduce some system-level 
techniques for transformation of the given system of 
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polynomials, which offer more common sub-expressions. 
Our optimization method reduces the complexity of 
polynomial datapaths in terms of the number of arithmetic 
operations by performing optimization on a system-level 
prior to high-level synthesis. Furthermore, in order to 
generate RTL datapath architecture for the optimized 
polynomials, we use GAUT high-level synthesis tool [12] as 
a high-level synthesis tool, although any other high level 
synthesis tools can be utilized. Our optimization method 
reduces the area and the number of clock cycles at the RTL 
datapath architectures. In this method, we use mathematics 
concept of univariate functional decomposition of 
polynomial expressions in order to obtain good building 
blocks and hence extract more common sub-expressions. 

In summary, our design flow in this paper consists of the 
following tasks: 

• System-level transformations to optimize datapath 
designs that perform polynomial computations over Z 2 m 
using univariate functional decomposition and 
canonization form. 

• Univariate functional decomposition of the given 
polynomials to obtain good building blocks and extract 
suitable common sub-expressions. 

• High-level synthesis using GAUT [12] to generate 
datapath architectures for the optimized polynomials as 
sequential circuits. 

• Evaluating the performance of the proposed method and 
showing its effectiveness by comparing it with the state- 
of-the-art polynomial optimization methods in the 
literature. 

The remainder of this paper is organized as follows. 
Section 2 introduces some preliminaries which are used in 
the rest of the paper. A motivational example is presented in 
section 3. Section 4 explains, in detail, our proposed 
polynomial optimization method. Section 5 evaluates the 
performance of our algorithms and presents experimental 
results that demonstrate their effectiveness. Finally, section 6 
provides our conclusion. 



2. Preliminaries 

This section introduces some preliminaries which are 
used in the rest of the paper. In this paper arithmetic data 
paths are modeled as polynomial functions over Z 2 ni x 
Z 2 n 2 x ... x Z 2 n d toZ 2 m[9]. Let /(x),..., / p (x) be p given 
polynomial functions over Z 2 ni x Z 2 n 2 x ... x Z 2 n d to Z 2 m 
as the specification where x = < x h x 2 , ...,x d > is a vector of d 
input variables and n h n 2 , ..., n d denote size of the 
corresponding variables. Z 2 n represents the finite set of 
integers {0, 1, ..., 2”-l}. m is the size of the output bit- vector 

f 

Theorem 1: Let / be a polynomial function from 
Z 2 ni x ... X Z 2 n d to Z 2 m. Then according to [9], / can be 
uniquely represented in a canonical form as (1), where Y k is 
falling factorial of degree k E Z (Z denotes the ring of 
integers) and is defined as follows, 

Y 0 (x)=l, Y 1 (x)=x, 



Y 2 (x)=x*(x-1), ..., Y h (x)=Y k . 1 (x)x(x-k+l). 

2 m 

a K is an integer such that 1 <a K < n d k !) , K=<k h k 2 , 

k d > for each k t = 1, 2, p h and p t = min{2 ni ,SF(2 m )}. 
SF(n) is the least k EM such that n divides k! , and denotes 
Smarandache function [10]. gcd(x,y) computes the greatest 
common divisor of x and y. 

f = TiK a K^K = TiK a K X Y kl (Xi) X ... X Y kd (x d ) (1) 

For example, let / = 2x 5 +x 4 +x 2 -2x, the canonical form of 
/ over Z 2 3 is 2x 2 . Note that the canonical form of a 
polynomial over Z 2 n ± x Z 2 n 2 x ...x Z 2 n d to Z 2 m may be 
zero. 

Definition 1: If g and h are univariate polynomials, then 
univariate polynomial f(x) = g(x) o h(x) is their functional 
composition, and (g, h) is a univariate functional 
decomposition of f where g and h are polynomials with 
lower degree than / and are called left decomposition factor 
and right decomposition factor of f respectively, o is the 
composition operator via computing the output of g when it 
has an argument of h(x) instead of x (i.e., f(x)=g(x) o h(x) 
=g(h(x))). 

Example 1: Let f(x)= x 4 +x 2 -3, then f(x)=g(x) o h(x) = 
(x 2 +x-3) o x 2 is a univariate functional decomposition of f 
where g(x) = x 2 +x- 3 and h(x) = x 2 . 



3. Motivational Example 

In this section, we present an example to motivate the 
optimization technique to be presented. In order to 
demonstrate the effectiveness of the proposed method, let us 
consider the following polynomial system. 

fi(x) = x 4 + 2 x 3 +x 2 +xy 3 -3xy 2 +2xy 

f 2 (x) = x 6 +3x 5 +3x 4 +x 3 +x 2 + x. 

This system needs 32 multiplications and 10 additions. 

After applying the factorization technique using 
MATLAB [11] to these polynomials, f and f 2 are 
transformed to the following forms 

f(x) = x(x 3 +2x 2 +x+y 3 -3y 2 +2y) 

f 2 (x) = x(x+1)(x 4 +2x 3 +x 2 + 1 ), 

which need 19 multiplications and 9 additions. 

By applying our proposed optimization method over Z 2 2 
to the original polynomials,/; is converted to the following 
form, 

h(x) = x 2 +x, g(x) = x 2 

fi( x ) = go h + xy 3 - 3 xy 2 +2xy = x 2 o (x 2 +x) + xy 3 -3xy 2 +2xy, 
because canonical form of xy 3 -3xy 2 +2xy over Z 2 2 is 0, 
fi(x)= x 2 o (x 2 +x) = (x 2 +x) 2 = h 2 , 
and /? is converted as follows. 
h(x) = x 2 + x, g(x) =x 3 + x. 






(b) 

Figure 1: (a) Datapath architecture of the polynomials, implemented using factorization, (b) Datapath architecture of the 
polynomials, implemented using our proposed method 



f 2 (x) = go h = (x 3 + x) o (x 2 +x) = (x 2 + x) 3 + x 2 +x = h 3 +h = 
h(h 2 +l) = h(fj+l). 

The optimized polynomial system requires only 3 
multiplications and 2 additions. We have used GAUT as a 
high-level synthesis tool to generate datapath architectures 
for the polynomial systems. GAUT tool has been used in 
many academic projects, and its HLS algorithms for binding, 
allocation, and scheduling are well documented [12]. 

We have used GAUT to generate datapath architectures 
for two modes; speed optimization and area optimization in 
which only one functional unit is considered for each 
operation type existed in the design. The datapath 
architecture of the polynomials, implemented using 
factorization, in the speed optimization mode is shown in 
Fig. 1(a). The datapath architecture of the polynomials, 
implemented using our proposed method is shown in Fig. 
1(b). 

The results reported by GAUT for the polynomials, 
implemented using factorization and our proposed method 
are shown in Table 1. We have used “notch” library, 
provided by GAUT, and we have set clock cycle to 20. This 
table reports area and number of the clock cycles, registers, 
multiplexers, and functional units (adder, subtracter, 
multiplier) in the datapath architectures of the factored 



polynomials and optimized polynomials using our proposed 
method, in speed optimization and area optimization modes. 

Table 1. Gaut report for the polynomials, implemented 
using factorization, and for the polynomials, implemented 
using our proposed method, in speed optimization and area 
optimization modes. 
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4. Proposed System-level Optimization Method 

We introduce some system-level techniques for 
transformation of the given system of polynomials, which 

















offer more common sub-expressions. Our optimization 
method reduces the complexity of polynomial datapaths in 
terms of the number of arithmetic operations by performing 
optimization on a system-level prior to high-level synthesis. 
Furthermore, to generate datapath architecture for the 
optimized polynomials as sequential circuits, we use GAUT 
high-level synthesis tool. Our optimization method reduces 
area and number of clock cycles in the datapath architectures. 

In the first phase of the proposed system-level 
optimization method, each given multivariate polynomial 
f(x h ..., x d ) is transformed to several univariate polynomials 
by representing / based on each input variable x t (1 < i < 
d ). Then each obtained univariate polynomial is decomposed 
through univariate functional decomposition algorithm 
explained in subsection 4.1 in order to obtain good building 
blocks. In the second phase, to extract common sub- 
expressions among the given polynomials, we make use of 
univariate functional decomposition algorithm unlike other 
works that utilize algebraic division technique [1][6][7]. 
Finally, among various forms of the polynomials in terms of 
the extracted common sub-expressions, the form with 
smallest number of the arithmetic operations is selected. 
These phases are explained in more details in the following 
subsections. 



4.1. Determining Building Blocks (Phase I) 

In this phase, each given multivariate polynomial / is 
transformed to several univariate polynomials by 
representing / based on each input variables. Then each 
obtained univariate polynomial is decomposed through 
univariate functional decomposition algorithm in order to 
obtain good building blocks. This phase is explained in the 
following steps. 

Step 1: Each given multivariate polynomial f(x h ..., x d ) is 
rewritten based on each input variable x t (1 < i < d) as ( 2 ). 



/ = 






fe lr .,ei_ u 



• e i+ 1- 



y. 6 ! ^i-l^i+l ^<2 

,e d x l "■ x i-l x i+ 1 — x d 



e 1 ,..,ei_ 1 ,ei +1 ,...,e d >0 



( 2 ) 



where f (x,) is a univariate polynomial which 

represents the polynomial / based on the variable x h and 
e 1 ,...,e i _ 1 ,e i+1 ,...,e d are degrees of d-1 variables x h ... y x^, 
Xi+j'..., x d in polynomial / 



After applying this transformation to all given 
polynomials / (1 < j < p) where p is the number of given 
polynomials, all obtained univariate polynomials 
f (x/) (1 < i < d) from all f are stored in a set 

named AllSubPoly x . (1 <i< d ). 



Example 2: Suppose fi(x,y) = x y +5x y+2x y +10x y- 
x 4 y 2 -5x 4 y +x 3 y 4 -x 3 y 2 -5x 3 y +2x 2 y 4 +2x 2 y 2 + 1 0x 2 y-xy 2 -5xy, and 

f 2 (x,y) = x 6 y 3 +x 4 y 4 -2x 4 y 3 +2xy-2x 4 y+x 2 y 3 +xy 4 +2xy-2xy. 



Then / based on the variable y is represented as 

fi =x 6 (y 2 +5y) +x 5 (2y 2 +10y) +x 4 (-y 2 -5y) +x 3 (y 4 -y 2 -5y) +x 2 (2y 4 + 
2y 2 + 1 Oy) +x(-y 2 -5y), 

so AllSubPoly y = { f,(y)= -y 2 -5y, f 2 (y)=2y 4 +2y 2 +10y, 
h(y) =y 4 -y 2 -5y, f/y) =-y 2 -5y, fs(y) =2y 2 +i0y, f 6 (y) =/+5y } . 



And fj based on the variable x is represented as 



fi = y 4 (x 3 +2x 2 )+y 2 (x 6 +2x 5 -x 4 -x 3 +2x 2 -x)+y(5x 6 +10x 5 -5x 4 - 

5x 3 +10x-5x ), 

so AllSubPoly x ={f 1 (x)=5x 6 +10x 5 -5x 4 -5x 3 +10x 2 -5x, f 2 (x)=x 6 
+2x 5 -x 4 -x 3 +2x 2 -x, f 4 (x) = x 3 +2x 2 } . 

f 2 based on the variable y is represented as 

f 2 = x (y 3 ) +x 4 (y 4 -2y 3 +2y 2 -2y) +x 2 (y 3 ) +x(y 4 +2y 2 -2y), 

so AllSubPoly y = AllSubPoly y U {y 3 ,y 4 -2y 3 +2y 2 -2y,y 4 +2y 2 
-2y}. 

And f 2 based on the variable x is represented as 

f 2 = y 4 (x 4 +x) +y 3 (x 6 - 2 x 4 +x 2 ) +y 2 ( 2x 4 +2x)+y( -2x 4 -2x ) , 

so AllSubPoly x = AllSubPoly x U {x^+x, x 6 -2x 4 +x 2 , 2x 4 
+2x, -2x 4 -2x}. 

Step 2: univariate functional decomposition is computed 
for each member of AllSubPoly x . (1 < i < d) (i.e., 
/ (Xj)) by using the univariate functional 

e l’-’ e j_]’ e i+l>'> e d 

decomposition algorithm explained in the follow. 

Univariate functional decomposition algorithm: Let g 
and h be polynomials of degrees r and s over a field. Their 
functional composition f=goh = g(h) has degree n = rx s. 
The univariate functional decomposition problem can be 
stated as follows: given / of degree n = r*s, determine 
whether such g and h exist, and in the affirmative case, 
compute them [13]. 

The pseudo code of the univariate functional 
decomposition algorithm [14], which is slightly modified in 
our method to also calculate indecomposable part of an input 
polynomial, is shown in Fig. 2. For every r and s values for 
which rxs = n, UniDec procedure in Fig. 2 with / and r as 
inputs calculates a univariate functional decomposition for / 
as (3), where f 0 is indecomposable part of f 



f(x) = g(x) o h(x) + fo = g(h(x )) + f 0 ( 3 ) 

As explained in [14], f g and h are in the following 
forms,/' — x rs +a rs _iX rs 1 +...+CL 0 , h — x s +c s _ix s ^+...+C;X, g — 
x r +b r _ 1 x r ~ 1 + ... +b 0 , respectively. In this algorithm, first, 
coefficients of h , i.e., are calculated from 

coefficients of /by hUniDec procedure (lines 4-11 in Fig. 
2). For this purpose, polynomial q k is defined as follows: 

q k — x s +c s _iX s 1 + . . . + c s . k x s k , 0<k<s. 

Then q 0 =x s , q s = q s .i=h , and q k = q k .]+ c s _ k x s ~ k , 1 <k<s. 

According to [14], we can calculate the first k+1 
coefficients of h r from coefficients of q k . The k+l st 
coefficient of q k is the coefficient of x rs ' k , this agree with a rs _ k , 
i.e., the k+l st coefficient of/ 1 <k < s-1. Thus if the earlier 
coefficients c s . 1 ,...,c s _ k+1 of h are known, then c s . k can be 
determined by computing 



where, d k is the coefficient oix rs ' k in q k .[ [14]. 

Second, from / and h , coefficients of g, i.e., (b 0 ,...,b r _ 7 ), 
are calculated by g_UniDec procedure (lines 12-16 in Fig. 2), 




let A[i,j] be the coefficient of x ls in h!, 0 < i, j < r. Then b = 
(bo^.^br.i), can be determined by solving the following 
equation: 

Ab = a , 

where a=(a 0 , a s , ..., a rs ) are the coefficients of f. 

Then, composition of h and g is computed by using the 
function subs , which is a function library of Maple [15] and 
computes the value of g o h. The difference between / and g 
oh is considered as indecomposable part of / and refereed as 
fo (line 15 in Fig. 2). 



UniDec (f ::polynom, r::integer) 
1: / 2 :=h_UniDec(/’, r); 

2: (g,/eO:=g_UniDecO r , h, r); 

3 : return each (/z, g,/ 0 ); 



h_UniDec(/::polynom , r: integer) 
qi)=x is , 0<i<r 
for k from 0 to (degree(f)/r) 

4:= coef(q^,_ 1 ,x (dcgrec(f, ' k> ) ; 

c ((degree(f)/r)-k) (^(degree(f)-k)" d k )/r ; 

qk+i:= q£; 
q' k+1 := So(|) x c 



(f 



degree (f) ^ ^ 

X q ] k _i ; l<j<r; 



X x i ^ degree( ^ f) / r ^ _k ^ 



10: ^: =5]i ( degree( 0 /r) cixxi; 

11: return h\ 



g_UniDec(/::polynom , /z::polynom , r: integer) 
12: A(i+ 1 ,j+ 1 ) :=coef(#, x (i *( degree(£)/r)) ) ; 0<ij<r; 
13: b := LinearSolve(A, a) ; 

14: g :=£o bixx' ; 

15: fo-=f- subs({x =h}, g); 

16: return (g,/ 0 ); 



Figure 2: Univariate functional decomposition algorithm 



Example 3: Let us consider a member of AllSubPoly x in 
example 2 ;f=x 6 +2x 5 -x 4 -x 3 +2x 2 -x. Because n = 6, one of the 
situation that r and s > 1 are r = 2 , s =3. So g and h are in the 
following forms. 

h = x 3 + c 2 x + g = x 2 + bfi+bo. 

The steps of the univariate decomposition algorithm are as 
follows. 

= i-qo =* 3 .<7 o =* 6 
step 1: ( k= 1) 

d x = coe/(q^x 6_1 ) = 0,c 2 = (a5 ~ dl) = 1 

= 1, q\ = x 3 + x 2 , q 2 = x 6 + 2x 5 + x 4 
step 2: ( k=2 ) 



d 2 = coef (c/i,x 6-2 ) = Uc x = (a4 ~ dz) = -1 
c/i = 1 ; q 4 = x 3 + x 2 — x, 
q 2 = x 6 + 2x 5 - x 4 - 2x 3 + x 2 
So h is obtained as h= x 3 + x 2 — x. 

Then by using the coefficients of f and h , coefficients of g are 
calculated as follows. 
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' 0 ' 
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So g is obtained as g = x 2 +x. 



The general form resulting by applying the univariate 
functional decomposition algorithm to each member of 
AllSubPoly x . (1 < i < d) is shown in (4). 



fe h ..,e hl ,e i+h ..,e d (Xi) = 9 fa) ° H*i) + fofa), 

(4) 

fe h ..,e,. h e, +] ,..,e d fa) C AllSubPoly x ., (1 < L < d) 

Step 3: In the third step of the first phase of the proposed 
method, all obtained right decomposition factors h(x t ) of all 
members of AllSubPoly x . are stored in a set named h_set x . 
as good building blocks. 

Example 4: Let us consider example 2 again. By 
computing the univariate functional decomposition of all 
members of AllSubPoly x and AllSubPoly y , h_set x for 
variable x and h_set y for variable y are obtained as follows. 

h_set x = {x 3 +x 2 -x, x 2 +x, x 3 -x, x 2 }, 
h_set y = {y 2 , y 2 -5y}. 

4.2. Common sub-expression extraction (Phase II) 

The aim of the second phase of the proposed method is to 
extract common sub-expressions between all given 
multivariate polynomials, which is equal to extract common 
sub-expressions between their equivalent univariate 
polynomials which are stored in AllSubPoly x . (1 <i<d). 

To extract common sub-expressions we make use of 
univariate functional decomposition algorithm unlike other 
works that utilize algebraic division technique [1][6][7]. By 
considering members of h_set x . as good building blocks, we 
try to re-decompose all members of AllSubPoly x . by these 
building blocks and find common sub-expressions between 
them. 

By using g_UniDec procedure described in subsection 
4.1, each f (xA e AllSubPoly x . is assessed 

whether a polynomial g f can be calculated from this 
polynomial and each member of h_set x . as shown in (5), 



fe 1 ,..,e,. h e, +h ..,e d fa') = 9* fa) ° h '( x i) + fo( x i) 



h'ixj) G h_set x .(l < i < d) 



( 5 ) 



where g'(x{) is a new right decomposition factor of 
/ , /oO;) is a new indecomposable part of 

e b--’ e l_l’ e i+b--’ e d 

f , and h'(xi) is a member of h set r .. Please note 

J e 1 ,..,e il ,e i+1 ,..,e d ^ L ' x i 

that each member of h_set x . may be belonged to different 
members of AllSubPoly x .. 

To reduce cost of the corresponding hardware 
implementation of each polynomial, we make use of 
canonical representation of / 0 (Xj) over Z 2 n t to Z 2 m, which is 
explained in section 2, and then compare cost of the 
canonical form with the original form of /o(*i), and select 
the lower cost form. 

Example 5: Let us consider two members of 

AllSubPoly x in example 2 ;pi= x 6 +2x 5 -x 4 -x 3 +2x 2 -x belonged 
to fj, and p 2 = x 6 -2x 4 +x 2 belonged to f 2 . Two members of 
h_set x in example 4 are h 1 =x 3 +x 2 -x, and h 2 =x 3 -x which are a 
right decomposition factor of p j and p 2 respectively. By 
applying the common sub-expression phase to p 7 and p 2 over 
Z 2 3, two obtained forms of these polynomials are as follows. 

Form 1: 

Pi=(x 2 -x) o h 2 +f 0 = (x -x) o (x 3 -x) +2x 5 +x 4 +x 2 -2x, 
canonical _form(f 0 ) = 2x 2 , so pi=(x 2 -x) o (x 3 -x) +2x 2 over Z 2 3, 
p 2 = x 2 o h 2 = x 2 o (x 3 -x). 

Form 2: 

Pi=(x 2 +x) o hj=(x 2 -x) o (x 3 +x 2 -x), 

p 2 =(x 2 +2x) ohj + f 0 = (x 2 +2x) o (x 3 +x 2 -x) -2x 5 -x 4 -2x 2 +2x. 

canonical_form(/}) = -3x 2 , so p 2 = (x 2 +2x) o (x 3 +x 2 -x) -3x 
over Z 2 3. 

Therefore result of the common sub-expression extraction 
phase is various decompositions of each f based 

on the different building blocks belonged to different 
members of AllSubPoly x .. 

4.3. Complete system-level optimization algorithm 

The complete proposed system-level optimization 
algorithm is explained in this subsection. The pseudo-code of 
the proposed method is shown in Fig. 3. Lines 3-10 describe 
the first phase of the proposed method in which each input 
multivariate polynomial is transformed to several univariate 
polynomials which are stored in AllSubPoly x . (1 < i < d) 
(lines 3-6). Then univariate functional decomposition 
algorithm in Fig. 2 is applied to these univariate polynomials 
and then all obtained right decomposition factors are stored 
in h_set x . as good building blocks (lines 7-10). Lines 11-16 
describe the second phase in which each member of 
AllSubPoly x . is re-decomposed by members of h_set x . 
using gJUniDec procedure in Fig. 2 (lines 11-14), and 
common sub-expressions are determined. The canonical form 
over Z 2 n| to Z 2 m is calculated for fo(Xj) in order to reduce 
cost of the implementation (lines 15-16). Finally, to select the 
form with the smallest number of arithmetic operations, 
every new generated form of all members of AllSubPoly x . is 
considered to evaluate related hardware implementation by 
computing cost June function. This function determines the 
number of arithmetic operations such as additions and 



multiplications needed for implementation of the given 
polynomials (lines 17-22). 



Optimization Algorithm (LP, d) 

1: LP := List of input polynomials (fij 2 , ... ,f P ) 

2: d := Number of variables 
3 : for j from 1 to p 
4: for i from 1 to d 

5: consider 7/ aS^:=Ee l ,..,e i . 1 ,e i+1 ,...,e d >0 fe 1 ,..,e i . 1 ,e i+1 ,..,e d x Xi‘ x 

...xx '" 1 x x ? + . 1 x...xx?; 

i_l 1+1 d 5 

6: Add every f ei ,..,e i _ 1 ,e i+1 ,..,e d Oi) to AllSubPoly^.; 

7: for i = 1 to d do 

8: for k from 1 to size(AllSubPoly x .) 

9: (h, g,/ 0 ):=UniDec(AllSubPoly ,r); 

10: Add h to h_set x ; 

1 1 : for i = 1 to d do 

12: for j from 1 to size(AHSubPoly x .) 

13: for each h ' of h_set x . ; 

14: (g\ /o):=g_UniDec(AHSubPoly , h\r)\ 

x i 

15: if(cost_func (canonical_form(/y) < cost_fimc (f 0 )) 

16: f 0 ’= canonical_form(/V); 

17: for j = 1 iop do 

1 8 : f/= reconstruct fj from AllSubPoly^ . ; 

19: for every combination of/} (transformation of the original f) 
20: if(cost_func(current combination)<min_cost) then 

2 1 : minimum cost combination = current combination 

22: return minimum cost combination 



Figure 3: System-level optimization algorithm 

Example 6: Let us consider the polynomial system in 
example 2. This polynomial system originally needs 115 
multiplications and 23 additions. By applying our proposed 
method, we get an implementation with only 16 
multiplications and 10 additions as shown below. 

tj=x 2 , t 2 = x(t r l), t 3 = t;+t 2 , t 4 =y 2 , 

fi= t/tfx+2) + y(y+5)t 3 (t 3 +l) , 
f 2 = (ti 2 +x)(t 4 (2+t 4 )-2y)+ yt 4 t 2 . 

The results reported by GAUT for the datapath architectures 
of the original and optimized polynomials as sequential 
circuits, in speed optimization mode, are shown in Table 2. 



Table 2. GAUT report for the original and optimized 
polynomials. 





Cycles 


Registers 


Muxes 


FU 


Area 


+ 


- 


X 


Original 


15 


21 


320 


3 


1 


12 


1028 


Optimized 


8 


12 


224 


2 


1 


4 


356 



5. Experimental Results 

In order to show the effectiveness of our proposed 
optimization method, we have employed different 
polynomials extracted from real embedded systems. Various 
combinations of multivariate cosine wavelet (MVCS) for 
graphic applications [7], Savitzky-Golay (SG) filters [16] and 
digital image rejection unit (DIRU) for image processing 



applications, Quadratic filters (Quad) for DSP applications 

[17] , Phase-Shift Keying (PSK) for digital communication 

[18] have been taken into account as multi-output 
polynomials. 



Table 3. Comparison of the datapath architectures in the 
speed optimization mode. 





DIRU 

PSK 

Quad 


DIRU 

PSK 

SG2 


DIRU 

Quad 

SG2 


DIRU 

MVCS 

SG2 


%A 




Cycles 


17 


17 


16 


16 


0.00 




Registers 


22 


24 


20 


20 




u 

o 


Muxes 


384 


352 


324 


336 




c 


+ 


2 


2 


3 


4 




a 


FU 


1 


1 


0 


0 






X 


11 


13 


7 


7 






Area 


937 


1103 


605 


613 






Cycles 


12 


15 


17 


13 


13.42 




Registers 


19 


37 


21 


24 




G J 


Muxes 


272 


368 


306 


304 




3 

a 4 


+ 


3 


3 


3 


4 




"E 

X 


FU 


1 


1 


0 


0 




c-> 

o 


X 


6 


21 


7 


7 




H 


Area 


530 


1775 


605 


613 







Cycles 


15 


15 


11 


13 


18.32 




Registers 


16 


15 


20 


26 




o 


Muxes 


243 


256 


304 


288 




=5 

& 


+ 


2 


2 


4 


4 




*E 

JS 


FU 


1 


1 


0 


0 




o 

QJ 


X 


5 


5 


7 


15 




H 


Area 


439 


439 


613 


1277 






Cycles 


13 


13 


11 


11 


27.39 


pC 

CJ 


Registers 


17 


19 


20 


22 




O 


Muxes 


320 


336 


288 


272 




G. 

& 


+ 


2 


2 


2 


3 




3 

Sh 


FU 


1 


1 


0 


0 




3 

o 


X 


4 


5 


7 


7 






Area 


356 


439 


597 


605 





We have implemented the proposed method along with 
the methods in [1] and [6] in Maple [15], and then we have 
used GAUT as a high-level synthesis tool [12] to 
automatically generate datapath architectures for obtained 
polynomials. 

To generate datapath architectures based on optimized 
polynomials, we considered two modes provided by GAUT; 
speed optimization and area optimization. Table 3 illustrates 
the results obtained using our proposed method, Horner 
form, and methods in [1] and [6] for speed optimization 
mode. This table reports area and number of the clock cycles, 
registers, multiplexers, and functional units (adder, 
subtracter, multiplier) in the obtained datapath architectures. 
%A indicates the percent of improvement in the number of 
required clock cycles in all methods compared to the Horner 
form. The results in the table indicate that in our method 
required clock cycles are reduced by an average of 38.11%, 
20.10%, and 12.24% in comparison with the Horner form, 
the method in [6] and the method in [1] across all 
benchmarks. This reduction indicates that our goal of 
reducing critical path delay has been achieved. Furthermore, 
area is improved in our method with an average improvement 
of 25.81% in comparison with other works. 



Table 4. Comparison of the datapath architectures in the 
area optimization mode. 





DIRU 

PSK 

Quad 


DIRU 

PSK 

SG2 


DIRU 

Quad 

SG2 


DIRU 

MVCS 

SG2 


%A 


u 

o 


Cycles 


72 


86 


68 


86 


0.00 


Registers 


13 


15 


13 


16 




Muxes 


592 


656 


560 


672 


3 

u 


+ 


1 


1 


1 


1 

0 


a 


FU 


1 


1 


0 




X 


1 


1 


1 


1 






Area 


99 


99 


91 


91 






Cycles 


52 


91 


72 


71 


8.38 


'O 


Registers 


16 


27 


17 


22 




o 


Muxes 


672 


1056 


800 


832 




3 

a 4 


+ 


1 


1 


1 


1 




"E 

X 


FU 


1 


1 


0 


0 




o 


X 


1 


1 


1 


1 




H 


Area 


99 


99 


91 


91 







Cycles 


36 


36 


55 


65 


37.92 




Registers 


12 


11 


18 


19 




M 

o 


Muxes 


411 


432 


752 


832 




3 


+ 


1 


1 


1 


1 




*S 

JS 


FU 


1 


1 


0 


0 




o 

QJ 


X 


1 


1 


1 


1 




H 


Area 


99 


99 


91 


91 






Cycles 


41 


48 


49 


52 


38.68 


pC 

CJ 


Registers 


16 


17 


18 


18 




O 

•~ 


Muxes 


624 


704 


720 


752 




3. 

3. 


+ 


1 


1 


1 


1 




3 

U 


FU 


1 


1 


1 


1 




3 

o 


X 


1 


1 


0 


0 






Area 


99 


99 


91 


91 





In the area optimization mode, only one functional unit is 
considered for each operation type existed in the design (i.e., 
all operations with the same type should be bound to a same 
functional unit). Table 4 illustrates the results obtained using 
our proposed method, Horner form, and methods in [1] and 
[6] for area optimization mode. The results reported in the 
table indicate that our proposed method provides fewer 
required clock cycles in comparison with the Horner form, 
the method in [6] and the method in [1] with an average of 
64.73%, 49.97%, and 0.01%, respectively, across all 
benchmarks. Such improvement in the required clock cycles 
with a fixed number of functional units indicates that by 
using our method the number of operations would be fewer 
than those of other works. In other words, our proposed 
method can efficiently determine common sub-expressions. 



6. CONCLUSION 

In this paper we have proposed a system-level 
optimization method for the data paths implemented using a 
system of polynomials. Our method optimizes polynomials 
to reduce the complexity of polynomial datapaths in terms of 
the number of arithmetic operations over Z 2 ™. In the 
proposed method first all given multivariate polynomials are 
transformed to several univariate polynomials. Then 
univariate functional decompositions are calculated for them 
to obtain good building blocks. To extract common sub- 
expressions we make use of univariate functional 




decomposition algorithm. We have used GAUT high-level 
synthesis tool to generate RTL datapath architectures for the 
optimized polynomials as sequential circuits. Experimental 
results show superiority of our approach in the area and delay 
savings in contrast with the other related works. As a future 
work, we are going to utilize multivariate functional 
decomposition algorithm to extract better building blocks. 
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